AITopics | partial result

Collaborating Authors

partial result

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Deterministic Inference across Tensor Parallel Sizes That Eliminates Training-Inference Mismatch

Zhang, Ziyang, Ding, Xinheng, Yuan, Jiayi, Liu, Rixin, Mao, Huizi, Xing, Jiarong, Liu, Zirui

arXiv.org Machine LearningNov-25-2025

Deterministic inference is increasingly critical for large language model (LLM) applications such as LLM-as-a-judge evaluation, multi-agent systems, and Reinforcement Learning (RL). However, existing LLM serving frameworks exhibit non-deterministic behavior: identical inputs can yield different outputs when system configurations (e.g., tensor parallel (TP) size, batch size) vary, even under greedy decoding. This arises from the non-associativity of floating-point arithmetic and inconsistent reduction orders across GPUs. While prior work has addressed batch-size-related nondeterminism through batch-invariant kernels, determinism across different TP sizes remains an open problem, particularly in RL settings, where the training engine typically uses Fully Sharded Data Parallel (i.e., TP = 1) while the rollout engine relies on multi-GPU TP to maximize the inference throughput, creating a natural mismatch between the two. This precision mismatch problem may lead to suboptimal performance or even collapse for RL training. We identify and analyze the root causes of TP-induced inconsistency and propose Tree-Based Invariant Kernels (TBIK), a set of TP-invariant matrix multiplication and reduction primitives that guarantee bit-wise identical results regardless of TP size. Our key insight is to align intra- and inter-GPU reduction orders through a unified hierarchical binary tree structure. We implement these kernels in Triton and integrate them into vLLM and FSDP. Experiments confirm zero probability divergence and bit-wise reproducibility for deterministic inference across different TP sizes. Also, we achieve bit-wise identical results between vLLM and FSDP in RL training pipelines with different parallel strategy. Code is available at https://github.com/nanomaoli/llm_reproducibility.

configuration, deterministic inference, opération, (12 more...)

arXiv.org Machine Learning

2511.17826

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.28)
Europe > Austria > Vienna (0.14)
North America > United States > Texas > Harris County > Houston (0.04)
(5 more...)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Precision-Scalable Microscaling Datapaths with Optimized Reduction Tree for Efficient NPU Integration

Cuyckens, Stef, Yi, Xiaoling, Geens, Robin, Dumoulin, Joren, Wiesner, Martin, Fang, Chao, Verhelst, Marian

arXiv.org Artificial IntelligenceNov-11-2025

Emerging continual learning applications necessitate next-generation neural processing unit (NPU) platforms to support both training and inference operations. The promising Microscaling (MX) standard enables narrow bit-widths for inference and large dynamic ranges for training. However, existing MX multiply-accumulate (MAC) designs face a critical trade-off: integer accumulation requires expensive conversions from narrow floating-point products, while FP32 accumulation suffers from quantization losses and costly normalization. To address these limitations, we propose a hybrid precision-scalable reduction tree for MX MACs that combines the benefits of both approaches, enabling efficient mixed-precision accumulation with controlled accuracy relaxation. Moreover, we integrate an 8x8 array of these MACs into the state-of-the-art (SotA) NPU integration platform, SNAX, to provide efficient control and data transfer to our optimized precision-scalable MX datapath. We evaluate our design both on MAC and system level and compare it to the SotA. Our integrated system achieves an energy efficiency of 657, 1438-1675, and 4065 GOPS/W, respectively, for MXINT8, MXFP8/6, and MXFP4, with a throughput of 64, 256, and 512 GOPS.

artificial intelligence, exponent, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2511.06313

Country:

Europe > Belgium > Flanders > Flemish Brabant > Leuven (0.04)
Asia (0.04)

Genre: Research Report (0.41)

Technology:

Information Technology > Artificial Intelligence > Robots (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

DeFT: Decoding with Flash Tree-attention for Efficient Tree-structured LLM Inference

Yao, Jinwei, Chen, Kaiqi, Zhang, Kexun, You, Jiaxuan, Yuan, Binhang, Wang, Zeke, Lin, Tao

arXiv.org Artificial IntelligenceMay-29-2024

Given the increasing demand for tree-structured interactions with LLMs, we introduce DeFT (Decoding with Flash Tree-Attention), an IO-aware tree attention algorithm tailored for tree-structured inference. Unlike traditional sequence-based decoding, tree-structured decoding better accommodates modern task requirements, including self-consistency, few-shot prompting, multi-step reasoning, and multi-model/head coordination. However, existing sequence-based inference systems are ill-suited for tree-structured decoding, resulting in redundancy in computation, memory footprints, and memory access, thereby undermining inference efficiency. To address this challenge, DeFT maintains memory-efficient attention calculation with low memory footprints through two key stages: (1) QKV Preparation: We propose a KV-Guided Grouping Strategy with Tree Split to intelligently group QKV, optimizing GPU resource utilization while minimizing memory reads/writes for KV cache between GPU global memory and on-chip shared memory; (2)Attention Calculation: We compute partial attention of each QKV group in a fused kernel and employ a Tree-topology-aware Global Reduction strategy to obtain final attention. By reducing 73-99% KV cache IO and nearly 100% IO for partial results during attention calculation (e.g., Softmax), DeFT achieves up to 2.52/3.82x speedup in the end-to-end/attention latency across three practical tree-based workloads: namely, few-shot prompting, multi-step reasoning, and speculative decoding, over state-of-the-art attention algorithms.

arxiv preprint arxiv, kv cache, query, (12 more...)

arXiv.org Artificial Intelligence

2404.00242

Country:

North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
North America > United States > Illinois > Champaign County > Urbana (0.04)
(4 more...)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

Partial Rewriting for Multi-Stage ASR

Bruguier, Antoine, Qiu, David, He, Yanzhang

arXiv.org Artificial IntelligenceDec-7-2023

For many streaming automatic speech recognition tasks, it is important to provide timely intermediate streaming results, while refining a high quality final result. This can be done using a multi-stage architecture, where a small left-context only model creates streaming results and a larger left- and right-context model produces a final result at the end. While this significantly improves the quality of the final results without compromising the streaming emission latency of the system, streaming results do not benefit from the quality improvements. Here, we propose using a text manipulation algorithm that merges the streaming outputs of both models. We improve the quality of streaming results by around 10%, without altering the final results. Our approach introduces no additional latency and reduces flickering. It is also lightweight, does not require retraining the model, and it can be applied to a wide variety of multi-stage architectures.

final result, partial result, transcript, (16 more...)

arXiv.org Artificial Intelligence

2312.09463

Country: North America > United States (0.04)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.88)

Add feedback

Improving Stability in Simultaneous Speech Translation: A Revision-Controllable Decoding Approach

Chen, Junkun, Xue, Jian, Wang, Peidong, Pan, Jing, Li, Jinyu

arXiv.org Artificial IntelligenceOct-6-2023

Simultaneous Speech-to-Text translation serves a critical role in real-time crosslingual communication. Despite the advancements in recent years, challenges remain in achieving stability in the translation process, a concern primarily manifested in the flickering of partial results. In this paper, we propose a novel revision-controllable method designed to address this issue. Our method introduces an allowed revision window within the beam search pruning process to screen out candidate translations likely to cause extensive revisions, leading to a substantial reduction in flickering and, crucially, providing the capability to completely eliminate flickering. The experiments demonstrate the proposed method can significantly improve the decoding stability without compromising substantially on the translation quality.

stability, translation, translation quality, (14 more...)

arXiv.org Artificial Intelligence

2310.04399

Country:

North America > United States (0.04)
Europe > Belgium (0.04)

Genre: Research Report (0.64)

Industry:

Media > Film (0.34)
Leisure & Entertainment (0.34)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Scoup-SMT: Scalable Coupled Sparse Matrix-Tensor Factorization

Papalexakis, Evangelos E., Mitchell, Tom M., Sidiropoulos, Nicholas D., Faloutsos, Christos, Talukdar, Partha Pratim, Murphy, Brian

arXiv.org Machine LearningFeb-27-2013

How can we correlate neural activity in the human brain as it responds to words, with behavioral data expressed as answers to questions about these same words? In short, we want to find latent variables, that explain both the brain activity, as well as the behavioral responses. We show that this is an instance of the Coupled Matrix-Tensor Factorization (CMTF) problem. We propose Scoup-SMT, a novel, fast, and parallel algorithm that solves the CMTF problem and produces a sparse latent low-rank subspace of the data. In our experiments, we find that Scoup-SMT is 50-100 times faster than a state-of-the-art algorithm for CMTF, along with a 5 fold increase in sparsity. Moreover, we extend Scoup-SMT to handle missing data without degradation of performance. We apply Scoup-SMT to BrainQ, a dataset consisting of a (nouns, brain voxels, human subjects) tensor and a (nouns, properties) matrix, with coupling along the nouns dimension. Scoup-SMT is able to find meaningful latent variables, as well as to predict brain activity with competitive accuracy. Finally, we demonstrate the generality of Scoup-SMT, by applying it on a Facebook dataset (users, friends, wall-postings); there, Scoup-SMT spots spammer-like anomalies.

artificial intelligence, coup-smt, machine learning, (18 more...)

arXiv.org Machine Learning

1302.7043

Country:

North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.05)
Africa > Senegal > Kolda Region > Kolda (0.04)
Asia > Middle East > Republic of Türkiye > Bingoel Province > Bingol (0.04)
(3 more...)

Genre: Research Report > New Finding (0.48)

Industry: Health & Medicine > Therapeutic Area > Neurology (1.00)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.86)

Add feedback

Investigations of Continual Computation

Shahaf, Dafna (Carnegie Mellon) | Horvitz, Eric (Microsoft Research)

AAAI ConferencesJun-23-2009

Autonomous agents that sense, reason, and act in real-world environments for extended periods often need to solve streams of incoming problems. Traditionally, effort is applied only to problems that have already arrived and have been noted. We examine continual computation methods that allow agents to ideally allocate time to solving current as well as potential future problems under uncertainty. We first review prior work on continual computation. Then, we present new directions and results, including the consideration of shared subtasks and multiple tasks. We present results on the computational complexity of the continual-computation problem and provide approximations for arbitrary models of computational performance. Finally, we review special formulations for addressing uncertainty about the best algorithm to apply, learning about performance, and considering costs associated with delayed use of results.

algorithm, computation, continual computation, (17 more...)

AAAI Conferences

Twenty-First International Joint Conference on Artificial Intelligence

Country: North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)

Genre: Overview (0.53)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.46)

Add feedback